50 research outputs found

    Overview of the 2nd international competition on plagiarism detection

    Get PDF
    This paper overviews 18 plagiarism detectors that have been developed and evaluated within PAN'10. We start with a unified retrieval process that summarizes the best practices employed this year. Then, the detectors' performances are evaluated in detail, highlighting several important aspects of plagiarism detection, such as obfuscation, intrinsic vs. external plagiarism, and plagiarism case length. Finally, all results are compared to those of last year's competition

    Overview of the 1st international competition on plagiarism detection

    Get PDF
    The 1st International Competition on Plagiarism Detection, held in conjunction with the 3rd PAN workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, brought together researchers from many disciplines around the exciting retrieval task of automatic plagiarism detection. The competition was divided into the subtasks external plagiarism detection and intrinsic plagiarism detection, which were tackled by 13 participating groups. An important by-product of the competition is an evaluation framework for plagiarism detection, which consists of a large-scale plagiarism corpus and detection quality measures. The framework may serve as a unified test environment to compare future plagiarism detection research. In this paper we describe the corpus design and the quality measures, survey the detection approaches developed by the participants, and compile the achieved performance results of the competitors

    Overview of the 3rd international competition on plagiarism detection

    Get PDF
    This paper overviews eleven plagiarism detectors that have been developed and evaluated within PAN'11. We survey the detection approaches developed for the two sub-tasks "external plagiarism detection" and "intrinsic plagiarism detection," and we report on their detailed evaluation based on the third revised edition of the PAN plagiarism corpus PAN-PC-11

    Overview of the CLEF-2019 Checkthat! LAB: Automatic identification and verification of claims. Task 2: Evidence and factuality

    Get PDF
    We present an overview of Task 2 of the second edition of the CheckThat! Lab at CLEF 2019. Task 2 asked (A) to rank a given set of Web pages with respect to a check-worthy claim based on their usefulness for fact-checking that claim, (B) to classify these same Web pages according to their degree of usefulness for fact-checking the target claim, (C) to identify useful passages from these pages, and (D) to use the useful pages to predict the claim's factuality. Task 2 at CheckThat! provided a full evaluation framework, consisting of data in Arabic (gathered and annotated from scratch) and evaluation based on normalized discounted cumulative gain (nDCG) for ranking, and F1 for classification. Four teams submitted runs. The most successful approach to subtask A used learning-to-rank, while different classifiers were used in the other subtasks. We release to the research community all datasets from the lab as well as the evaluation scripts, which should enable further research in the important task of evidence-based automatic claim verification

    Dense vs. Sparse representations for news stream clustering

    Get PDF
    The abundance of news being generated on a daily basis has made it hard, if not impossible, to monitor all news developments. Thus, there is an increasing need for accurate tools that can organize the news for easier exploration. Typically, this means clustering the news stream, and then connecting the clusters into story lines. Here, we focus on the clustering step, using a local topic graph and a community detection algorithm. Traditionally, news clustering was done using sparse vector representations with TF\u2013IDF weighting, but more recently dense representations have emerged as a popular alternative. Here, we compare these two representations, as well as combinations thereof. The evaluation results on a standard dataset show a sizeable improvement over the state of the art both for the standard F1 as well as for a BCubed version thereof, which we argue is more suitable for the task

    Preface

    Get PDF
    In the Iberian Peninsula, five official languages co-exist: Basque, Catalan,Galician, Portuguese and Spanish. Fostering multi-linguality and establishingstrong links among the linguistic resources developed for each language of theregion is essential. Additionally, a lack of published resources in some of theselanguages exists. Such lack propitiates a strong inter-relation between themand higher resourced languages, such as English and Spanish.In order to favour the intra-relation among the peninsular languages as wellas the inter-relation between them and foreign languages, different purposemultilingual NLP tools need to be developed.Interesting topics to beresearched include, among others, analysis of parallel and comparable corpora,development of multilingual resources, and language analysis in bilingualenvironments and within dialectal variations.With the aim of solving these tasks, statistical, linguistic and hybrid ap-proaches are proposed. Therefore, the workshop addresses researchers fromdifferent fields of natural language processing/computational linguistics: textmining, machine learning, pattern recognition, information retrieval andmachine translation.The research in this proceedings includes work in all of the official languages ofthe Iberian Peninsula. Moreover, interactions with English are also included.Wikipedia has shown to be an interesting resource for different tasks and hasbeen analysed or exploited in some contributions.Most of the regions of the Peninsula are represented by the authors of thecontributions. The distribution is as follows: Basque Country (2 authors),Catalonia (7 authors), Galicia (4 authors), Portugal (2 authors) and Valencia(5 authors). Interestingly, those regions where Spanish is the only officiallanguage are not represented. It is worth noting that authors working beyondthe Peninsula have also contributed to this workshop, including: Argentina (3authors), Finland (1 author), France (2 authors), Mexico (1 author), Singapore(1 author), and USA (6 authors)

    Qlusty: Quick and dirty generation of event videos from written media coverage

    Get PDF
    Qlusty generates videos describing the coverage of the same event by different news outlets automatically. Throughout four modules it identifies events, de-duplicates notes, ranks according to coverage, and queries for images to generate an overview video. In this manuscript we present our preliminary models, including quantitative evaluations of the former two and a qualitative analysis of the latter two. The results show the potential for achieving our main aim: contributing in breaking the information bubble, so common in the current news landscape

    Preface

    Get PDF
    These proceedings contain the papers of the Third International Workshop on Recent Trends in News Informa-tion Retrieval (NewsIR\u201919) held in conjunction with the ACM SIGIR 2019 conference in Paris, France, on the25thof July 2019. Ten full papers and two short papers (one position paper and one demo paper) were selectedby the programme committee from a total of 21 submissions. Each submitted paper was reviewed by at leastthree members of an international programme committee. In addition to the selected papers, the workshopfeatures one keynote and one invited talk. The Keynote speech is given by Aron Pilhofer \u201cFrom Redlining toRobots: How newsrooms apply technology to the craft of journalism\u201d. The invited talk is given by FriedrichLindenberg \u201cMining Leaks and Open Data to Follow the Money\u201d. We would like to thank SIGIR for hostingus. Thanks also go to the keynote speakers, the program committee, the paper authors, and the participants,for without these people there would be no worksho

    Prta: A System to Support the Analysis of Propaganda Techniques in the News

    Get PDF
    Recent events, such as the 2016 US Presidential Campaign, Brexit and the COVID-19 "infodemic", have brought into the spotlight the dangers of online disinformation. There has been a lot of research focusing on fact-checking and disinformation detection. However, little attention has been paid to the specific rhetorical and psychological techniques used to convey propaganda messages. Revealing the use of such techniques can help promote media literacy and critical thinking, and eventually contribute to limiting the impact of "fake news" and disinformation campaigns.Prta (Propaganda Persuasion Techniques Analyzer) allows users to explore the articles crawled on a regular basis by highlighting the spans in which propaganda techniques occur and to compare them on the basis of their use of propaganda techniques. The system further reports statistics about the use of such techniques, overall and over time, or according to filtering criteria specified by the user based on time interval, keywords, and/or political orientation of the media. Moreover, it allows users to analyze any text or URL through a dedicated interface or via an API. The system is available online: https://www.tanbih.org/prta

    Thread-level information for comment classification in community question answering

    Get PDF
    Community Question Answering (cQA) is a new application of QA in social contexts (e.g., fora). It presents new interesting challenges and research directions, e.g., exploiting the dependencies between the different comments of a thread to select the best answer for a given question. In this paper, we explored two ways of modeling such dependencies: (i) by designing specific features looking globally at the thread; and (ii) by applying structure prediction models. We trained and evaluated our models on data from SemEval-2015 Task 3 on Answer Selection in cQA. Our experiments show that: (i) the thread-level features consistently improve the performance for a variety of machine learning models, yielding state-of-the-art results; and (ii) sequential dependencies between the answer labels captured by structured prediction models are not enough to improve the results, indicating that more information is needed in the joint model
    corecore